Estimating Collection Size in Distributed Search

نویسندگان

  • Jingfang Xu
  • Sheng Wu
  • Xing Li
چکیده

Distributed search is an effective way to search information over thousands of information collections available on the web. As an important feature in distributed search, collection size plays a vital role in resource representation and selection. This paper proposes two novel algorithms to estimate collection size in uncooperative environments. Sample high frequent resample (SHFRS) algorithm firstly samples collections with random queries and then resamples with highest frequent queries in sample sets. Considering different capture probabilities across documents, heterogeneous capture (HC) algorithm estimates collection size with conditional maximum likelihood. Both algorithms are evaluated on real web data. Experimental results show that our algorithms outperform significantly both sample-resample and capture-recapture algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How Much Data Resides in a Web Collection: How to Estimate Size of a Web Collection

With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [4]. The tendency to kn...

متن کامل

Distributed Generation Effects on Unbalanced Distribution Network Losses Considering Cost and Security Indices

Due to the increasing interest on renewable sources in recent years, the studies on integration of distributed generation to the power grid have rapidly increased. In order to minimize line losses of power systems, it is crucially important to define the size and location of local generation to be placed. Minimizing the losses in the system would bring two types of saving, in real life, one is ...

متن کامل

Distributed Generation Effects on Unbalanced Distribution Network Losses Considering Cost and Security Indices

Due to the increasing interest on renewable sources in recent years, the studies on integration of distributed generation to the power grid have rapidly increased. In order to minimize line losses of power systems, it is crucially important to define the size and location of local generation to be placed. Minimizing the losses in the system would bring two types of saving, in real life, one is ...

متن کامل

Estimating the size of search trees by sampling with domain knowledge

We show how recently-defined abstract models of the Branch & Bound algorithm can be used to obtain information on how the nodes are distributed in B&B search trees. This can be directly exploited in the form of probabilities in a sampling algorithm given by Knuth that estimates the size of a search tree. This method reduces the offline estimation error by a factor of two on search trees from Mi...

متن کامل

Estimating Size of Search Engines in an Uncooperative Environment

The number of documents that are indexed by a search engine is referred to as the size of the search engine. The information about the size of each underlying search engine is essential for any metasearch engine to conduct search engine selection, result merging and a few other processes. Thus, effectively estimating the size of search engines is important for a metasearch engine that incorpora...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007